Dissecting financial markets: Sectors and states 



Matteo Marsili 

Abdus Salam International Center for Theoretical Physics, Strada Costiera 11, 34OI4 Trieste, Italy 

and 

Istituto Nazionale per la Fisica della Materia (INFM), Unitd Trieste SISSA, Via Beirut 2-4, 34014 Trieste and 

(February 1, 2008) 

By analyzing a large data set of daily returns with data clustering technique, we identify economic 
sectors as clusters of assets with a similar economic dynamics. The sector size distribution follows 
Zipf's law. Secondly, we find that patterns of daily market-wide economic activity cluster into 
classes that can be identified with market states. The distribution of frequencies of market states 
shows scale-free properties and the memory of the market state process extends to long times (~ 50 
days). Assets in the same sector behave similarly across states. We characterize market efficiency 
by analyzing market's predictability and find that indeed the market is close to being efficient. We 
find evidence of the existence of a dynamic pattern after market's crashes. 



I. INTRODUCTION 

Thanks to the availability of massive flows of financial 
data, theoretical insights on financial markets can nowa- 
days be tested to an unprecedented precision in socio- 
economic systems. This poses a challenge which has at- 
tracted natural scientists who have pioneered an empir- 
ical approach to financial fluctuations independent 
of the econometric approach and often in contrast with 
the axiomatic approach of theoretical finance . 

The empirical evidence depicts financial markets as 
complex self-organizing critical systems: The statistics 
of real market returns deviate considerably from the 
Olympic Gaussian world described by Louis Bachclier at 
the turn of last century. Rather Mandelbrot [|j observed 
that fractal (Levy) statistics gives a closer approxima- 
tion, even though that is not a satisfactory model [|l],D. 
Market returns display scaling ||], long raiigc volatility 
correlations and evidence of multiscaling M have also 
been discussed. Such features evoke the theory of critical 
phenomena in physics, which explains how quite similar 
features may emerge from the interaction of many mi- 
croscopic degrees of freedom and statistical laws. Indeed 
financial markets are systems of many interacting degrees 
of freedom (the traders) and there are very good theo- 
retical reasons to expect that they operate rather close 
to criticality These expectations have been substan- 
tiated by microscopic agent based market models [p|-pT| : 
The picture offered by these synthetic markets is one 
where speculation drives market to information efficiency 
- i.e. to a point where market returns are unpredictable. 
But the point where markets become exactly efficient is 
the locus of a phase transition. Close to the phase tran- 
sition the behavior of synthetic markets is characterized 
by the observed stylized facts - fat tails and long range 
correlations - whereas far from the critical region the 
market is well described in terms of random walks (see 
Ref. for a non technical discussion). 

Work has however been mostly confined on single as- 



sets or indices. Recently ensembles of assets and their 
correlations have become the focus of quite intense in- 
terest. On one side the role of random matrix theory 
has been realized as a tool for understanding how noise 
dresses financial correlations [|l2| how one can undress 
them fl^ , how clustering techniques can help under- 
standing the structure of correlation jl4| , and the impact 
of such consideration on portfolio optimization . 

Here we report findings that strongly support the 
view of a self-organized critical market. We show that 
long range correlations and scale invariance extends both 
across assets and, in the behavior of the ensemble of as- 
sets, across frequencies. More precisely, we apply a novel 
parameter free data clustering method [ p^[p^ to a large 
financial data set [ p^ in order to uncover the internal 
structure of correlations both across different assets and 
across different days. We identify statistically significant 
classifications of assets in correlated sectors and of daily 
profiles of market-wide activity in market states. Both 
the statistics of sector sizes and of state sizes shows scale 
free properties. 

Determining market's states is an important achieve- 
ment both theoretically and practically: The concept of 
a state which codifies all relevant economic informations 
is the basis of many theoretical models of financial mar- 
kets. But practically every day traders experience a quite 
different reality: The market place is flooded with mas- 
sive flows of information of which it may be hard to say 
what is relevant and what is irrelevant. It is by no means 
obvious that something like market states exists at all 
and even if they exist the problem becomes that of iden- 
tifying them. Our aim is to give a practical answer to 
these questions. We shall keep our discussion as simple 
as possible, relegating technical details in notes and in 
the appendix. 
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II. THE METHOD AND THE DATA SET 

The data clustering method that we use has been re- 
cently proposed in Ref. In brief, it is based on 
the simple statistical hypothesis that similar objects have 
something in common. It is possible to compute the like- 
lihood that a given data set satisfies this hypothesis and 
hence to look for the most likely cluster structure. A 
precise definition is given in the appendix and for more 
details wc refer the interested reader to Refs. [ |l3| , [T6t . Let 
us only mention that this method overcomes several lim- 
itation of traditional data clustering approaches, such as 
the needs of pre-defining a metric, fixing a priori the 
number of clusters or tuning the value of other parame- 
ters 11. 

The data set covers a period from 1st January 1990 
to 30th of April 1999 and it reports daily prices (open, 
hi, low, close) for 7679 assets traded in the New York 
Stock Exchange |jl^. The number of assets actu- 
ally traded varies with time. Hence we mainly fo- 
cus on a subset of the 2000 most actively traded as- 



sets (see http://www.sissa.it/dataclustering/fin/ 



for the detailed list of assets considered, as well as for 
further informations). 

Our goal is to investigate the internal structure of cor- 
relations hence we first normalize the raw data |l^ in or- 
der to eliminate common trends and patterns both across 
assets and across different days. This procedure elimi- 
nates for example the so-called "market mode" , i.e. the 
constant correlation of individual asset's returns with the 
so-called "market's return" . 



III. MARKET SECTORS: SCALE FREE 
MARKET STRUCTURE 

We first apply data clustering to group assets with a 
similar economic dynamics in sectors of correlated assets 
(see appendix). This classification reveals a rich struc- 
ture. The clusters giving the largest contributions to the 
log-likelihood clearly emerge from the noisy background 
in Fig. [|. We find a large overlap with the sectors of eco- 
nomic activity defined by the Standard Industrial Classi- 
fication (SIC) codes (see caption of Fig. |l|). But wc also 
find significant correlations between assets with widely 
different SIC. This has practical relevance for risk man- 
agement of large portfolios which cannot be handled all at 
once. Indeed rather than splitting the problem according 
to economic sectors (defined by the SIC) it is preferable 
to use our classification in correlated sectors. The differ- 
ence of the two classifications is also revealed by a Zipf 's 
plot of the size of sector against its rank (see inset of 
Fig. |lj). The distribution of correlated sector sizes fol- 
lows Zipf 's law to a high accuracy, i.e. the number Af(n) 
of sectors with more than n firms (i.e. of size larger than 
n) is inversely proportional to n. Note that the scale 
free distribution of sector sizes is not due to an analo- 



gous property of fundamentals. Indeed the rank plot of 
economic sector sizes bends in log-log scale. This sug- 
gests that Zipf's law arises as a dynamical consequence 
of market interaction. 



The scale invariant behavior is robust with respect to 
the subset of assets taken: The same behavior is found 
considering the 1000, 2000 or 4000 most actively traded 
assets, in that period or 443 assets in the S&P500 index 
(see Ref. |l^). In addition we find, as in Ref. [Q, that 
the correlation Cg inside sector s (see appendix) scales 
with its size Us with a law Cs ^ with 7 ~ 1.66. 




FIG. 1. Dendrogram of the cluster structure of correlated 
sectors resulting from hierarchical clustering algorithm. As- 
sets are reported along the horizontal axis and red shapes 
correspond to clusters of correlated assets. The height of a 
shape is the contribution to the log-likelihood of the corre- 
sponding cluster of assets. See the appendix for more details. 
The cluster structure is statistically significant because the 
noise level corresponding to uncorrelated data would show 
structures with a log-likelihood of at most 0.1, three orders 
of magnitude smaller. The classification in sectors has a 
large overlap with economic sectors. For example, clusters 
1 and 2 contain firms in the electric sector and computers 
respectively. Cluster 4 is the sector of gold, 5 is composed 
of banks, 8 contains oil and gas firms, 9 petroleum. Clus- 
ters 3, 6 and 7 are mixed clusters (more details ar e avail- 
able at http : //www. sissa. it/dataclustering/f in/). Inset: 
Distribution of correlated sector sizes for 2000 (•) and 4000 
(□) assets. The distribution of the size of economic sectors 
(o), as defined by the (first two digits of the) SIC codes, for the 
same 4000 assets is shown for comparison. The line (drawn 
as a guide to the eyes) has slope —1. 



We finally remark that this property is not an artifact 
of the method. Indeed the distribution of eigenvalues of 
the correlation matrix shows a similar broad distribution, 
even though that is affected by considerable noise dress- 
ing ijl^ . A factor model which takes into account a large 
enough number of principal components (corresponding 
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to the largest eigenvalues) reproduces the same features""^ . 



IV. MARKET STATES 



Are there well defined patterns of daily market-wide 
economic performance? In order to answer this question, 
rather than classifying assets according to their temporal 
evolution, we can classify days according to the perfor- 
mance of different assets. Fig. || implies that, above 
a noisy background, a meaningful classification of the 
daily profiles of market activity exists. Clusters of days 
can be identified with different patterns of market wide 
activity - or market states. Quite remarkably, the maxi- 
mum likelihood classification in market states shows scale 
free features, for large clusters (frequent patterns of mar- 
ket activity). The number of patterns which occur more 
than d days behaves as J\f{d) ~ d~^^^ for the most fre- 
quent patterns (inset top). There is a clear crossover in 
the plot of cluster's correlation versus cluster size which 
distinguishes the meaningful clusters (patterns) from a 
random noise background (inset bottom). 




FIG. 2. Same plot as Fig. 1 for days: Clusters of days iden- 
tify market states. We identify states (see labels) as groups 
of correlated clusters of days. Inset: Distribution of cluster 
sizes, i.e. of the frequency with which states occur (top) and 
correlation Cs inside each cluster (bottom). 

From a sample of 2000 assets over T = 2358 days we 
identify 5 different states - characterized by similar pro- 
files of market activity - plus a sixth random state (see 
Fig. ||). We assign an integer ^{t) between 1 and 6 to 
each day t, which is the state which occurred in that day. 



We are then in a position to analyze market perfor- 
mance in different states. Fig. ^ shows the (non normal- 
ized) average daily returns of different asset in different 
states. We find that market's behavior in states 1 and 2 
are anti-correlated: Those assets which go up in state 1 
go down in state 2, on average. Fig. || also shows that 
assets in the same sector as defined above have a similar 
behavior. So, for example, while most of the assets go 
up in state 1 and down in state 2, the cluster of assets of 
Gold and Silver mining has an opposite behavior. State 
3 is clearly characterized by a fall of High-tech compa- 
nies and a mild rise in the electric sector. An opposite 
behavior takes place in state 4, whereas state 5 is domi- 
nated by the a marked rise of Oil & Gas, and Petroleum 
refining companies jl^ . 

These results arc remarkably stable with respect to the 
definition of the time window where the analysis is per- 
formed 0. 
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FIG. 3. Performance of the market in different states. Each 
asset i corresponds to a point whose coordinates are the av- 
erage returns {(ri\Lj}, {ri\u)')) of asset i in states uj and uj' . 
Assets in different sectors are plotted differently. 



A. Predictability and market efficiency 

Clustering the market's dynamics leaves us with the se- 
quence uj{t) of the states of the market in different days 
t = 1, . . . , r. This allows us to pose interesting questions 
on predictability and market's information efficiency. 



^In our case ~ 30 eigenvalues of the correlation matrix are 
significantly outside the noise band predicted by Random Ma- 
trix Theory . With a correlation matrix which retains the 
structure of the first ~ 20 principal components (considering 
the remaining components as uncorrelated noise) we found a 
quite similar cluster structure. 
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Let us first ask: Is it possible to predict the state uj' 
of the market tomorrow, given the state w of the market 
today? In order to answer this question we estimate the 
probability 



P^{uj'\uj) 



T-l T-1 

^c^(t),i^^w(t+i),i^'/ X! ' 
t=i t=i 



of transition from state lo to state lo' . It turns out that 
both the classification in states and the transition matrix 
P\{uj'\lS) are very stable with respect to the definition of 
the time window [|9|. This means that they both vary 
very slowly in time. Hence we shall neglect their varia- 
tion in time henceforth. 

//the process uj[t) were Markovian, its predictability 
could be quantified by the characteristic time r of con- 
vergence to the stationary state. This is related to the 
second largest (in absolute value) eigenvalue A of the ma- 
trix Pi(w» by r = -l/log|A|. We find r w 0.54 days 
- a value which would occur by chance, if there were no 
correlations, in one out of 10'' cases^. Statistical predic- 
tion is possible. 

Can we predict market's returns on the basis of these 
results? Fig. |^ shows that average returns (ri(i)|cj(i)) 
conditional on the state ix>(t) of the market contain non- 
trivial information. However this information is not avail- 
able for trading in day t. But if we know the transition 
matrix P-y[lo'\uj) we can estimate the expected return of 
asset i tomorrow given the state lo today: 

(r,(t + l)|w(t)) ^^(r,{t + \)\^(t^\) ^ J)P^{J\oj{t)). 



A natural measure of predictability, inspired by works on 
theoretical models ^0| , |2l| ,p|,m , is the averaged signal-to- 
noise ratio defined as: 



where 6ri{t) = ri{t) — (ri) and pui is the frequency with 
which state uj occurs. The distribution of Hi across assets 
is shown in Fig. ^ for t' = t, t' = t + 1 and t' = t + oo. 
The latter gives a benchmark of the background noise 



level. We find Hi{t\t) Hi{t + Qo\t) for several assets 
i: the knowledge of Lu{t) before day t provides significant 
predictive power on excess returns. That same informa- 
tion is much less useful the day after, since H{t + l\t) 
is only slightly above the noise level. This is a further 
indication that the financial market is close to informa- 
tion efficiency, but not quite unpredictable. In reality 
the transition matrix Pi{lo'\uj) changes slowly in time. 
Hence this conclusion provides an "upper bound" for the 
market's predictability (when measured out-of-sample) : 
Real markets are therefore even closer to efficiency. 



If uj{t) were a Markov process, the characteristic time 
Tfc for transitions a;(t) uj{t + k) over k days'^ should 
decrease with k as = ri/fc. A prediction of the future 
state of the market, which is significantly better than a 
random draw, would only be possible on a time horizon 
of one day, if the process were Markovian. The inset 
of Fig. ^ shows that rj, remains significantly above the 
noise level almost up to fc « 100 days! This means that 
Lo{t) carries significant information about the future state 
Lo{t + k) of the market, even after fc w 50 days. The slow 
decay of is a further signature of the presence of long 
range correlations. 
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^This conclusion was reached considering the characteristic 
times r for symbolic sequences uj{t) generated by randomly 
reshuffling days. These times are distributed around r ~ 0.33 
with a spread St ~ 0.04. The analysis of the tail of the dis- 
tribution allows to estimate the likelihood of r ~ 0.54 for the 
real sequence. 

^Tfe is computed in the same way as r = ri above, from 
the matrix Pk{io'\Lo) of transition probabilities u{t) = uj —> 
U3{t + k) = u}' in k days. For a Markov process this matrix is 
the fc**^ power of the matrix Pi{uj'\uj) and its eigenvalues are 
given by Afe = Aj'. 
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FIG. 4. Distribution of predictability 

Hi{t'\t) for t' — t, t + 1 and t + oo. The noise background 
predictability Hi{t + oo\t) is estimated drawing uj{t + oo) at 
random from the populations of states. Inset: Characteristic 
times Tk for transitions over k days for the real sequence uj{t) 
(•), a random sequence (+) and a Markov chain sequence (o) 
generated with the transition probability Pi{lu'\(jj) estimated 
from uj{t). The random sequence (+) represents the noise 
background. For a Markov chain (o) is significantly above 
the noise level only for k = 1. For the real market process 
is well above the noise level up to fc ~ 50. 

During the period we have studied, two major ex- 
treme events occurs: the 27 October 1997 and the 31 
August 1998 crashes. The state process Lu{t) is differ- 
ent before the crash, but is quite similar after it. The 
strings of states, starting from the day of the crash, read 
2136613611 ... and 2126614633 ... in the two cases. This 
is a significant similarity^. This suggests the existence 
of a particular dynamical pattern with which markets 
respond to extreme events (see also Ref. |22 on this). 



V. CONCLUSION AND OUTLOOK 

In conclusion we show that both the horizontal cluster- 
ing of assets in correlated sectors and the vertical classi- 
fication of market-wide economic performance in market 
states, reveal a scale free structure (see Figs. |^, ||). The 
emergent picture poses quite severe constraints on multi- 
asset agent based modeling, which we believe will disclose 
important information on how real markets work. This 
expectation is based on the fact that scale-free statistical 
behavior is a signature of interaction mechanisms which 
is rather insensitive to microscopic details. 

Furthermore, the identification of market states allows 
us to precisely quantify informational efficiency by com- 
puting the market's predictability, thereby establishing a 
direct contact between the empirical world and the realm 
of theoretical models. In particular we find that, as ex- 
pected, markets are close to information efficiency. 

We find that correlated sectors have a large overlap 
with sectors of economic activity. In the same way, it 
would be interesting to understand how states are cor- 
related with economic information and the news arrival 
process. 

In a wider context, we have discussed an unsupervised 
approach to the study of a complex system. Be it a stock 
market, the world economy, urban traffic network, a cell 



of a living organism or the immune system, the com- 
plex system can be considered as a black box. We show 
how a series of simultaneous measures in many different 
"points" of the system allows one to identify its parts and 
its states. 

A black box approach to a financial market or to a 
cell, which neglects all of economics and finance or of 
biology and genetics and relies only on empirical data, 
may lead to misleading results specially if the data set is 
incomplete. Still, we believe, it has the potential of un- 
covering collective aspects which can hardly be derived 
in a theoretical bottom-up approach. 



APPENDIX A: MAXIMUM LIKELIHOOD DATA 
CLUSTERING 

Consider a set of N objects each of which is defined 
in terms of D measurable features, so that each object 
is represented by a vector e , i = 1, . . . , iV. We 
assume for simplicity that data are normalized: • e = 
where e = (1, 1, . . . , 1) and = 1*^1*^ = 1. 

In our case, when identifying sectors, the objects are 
assets and N = A, the number of assets. Their features 
are the daily returns in each day t and D = T. The t^^ 
component of is Xi {t) / \/T . When identifying states in- 
stead objects are days and features are assets (i.e. N = T 
and D = A). The i^^ component of is Xi{t)/VA. 

The problem of classifying N objects into different 
classes goes under the name of data clustering. Naively 
one would like to have similar objects classified in the 
same cluster, but in practice one faces a number of prob- 
lems: What does it mean similar? What is the "right" 
number of clusters? Which principle to follow? We resort 
to a recent data clustering technique [ p^ljl^ based on the 
maximum likelihood principle and a simple statistical hy- 
pothesis: similar objects have something in common. In 
mathematical terms, we let Si be the label of the cluster 
to which object i belongs, and As = {i : Si = s} be the 
set of objects with Si = s. We assume that 



S.i "^^gsiVs, + a/1 - OsA- 



(Al) 



Here ffg denoted the common component shared by all 
objects i € As and (7s > weights the co mmo n com- 
ponent against the individual one e,. Eq. (Al) is the 
statistical hypothesis where and Si are the parame- 
ters to be fitted. Assuming further that both ffs and et 



*Only two other string of the type 21x661 occurred in the 
process but the starting days were Fridays (90/04/27 and 
90/05/25) and not Mondays. Note furthermore that normal- 
ization hM removes the collective component of the dynamics 
and it ensures that crash days appear with the same weight 
as normal days in the analysis. 
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are Gaussian vectors in , with zero average and unit 
variance (i?[||?7s||^] = i?[||ei|p] = 1) makes it possible to 
compute the hkehhood of the parameters Q — {gs} and 
S = {si} (see Ref. for details). The likelihood is 
maximal when 



gs = \ max 



0,- 



(A2) 



where Ug = \ As \ is the number of objects in cluster s and 

is the total correlation inside cluster s. The maximum 
log-likelihood per feature takes the form 



max 



O,log— + (n,-l)log^ ^ 

Cs ni - Cs 



Note that a cluster with a single isolated object {ug — 
Cs = 1), or a cluster of uncorrelated objects (cs = Hs) 
gives a vanishing contribution to the log-likelihood. 

Several algorithms for finding an approximate maxi- 
mum of Cc over the space of cluster structures S have 
been discussed in Ref. |16|. We used both hierarchi- 
cal clustering and simulated annealing algorithms, which 
yield quite similar results (the codes are available on the 
Internet jll). 

Figures |l| and |2| are a graphic representation of the hi- 
erarchical clustering algorithm: It starts from N clusters 
composed of a single object and it produces a sequence 
of cluster structures. At each iteration, two clusters of 
the configurations with K clusters are merged so that the 
log-likelihood of the resulting configuration with K — 1 
clusters is maximal. This procedure starts with K = N 
and it stops with K = 1, when a single cluster is formed. 
The log-likelihood of the cluster structure is £c = when 
if = TV, it decreases with K and it reaches a minimum 
for an intermediate value of K. Then it increases again 
and reaches Cc ~ when K = 1, because of data nor- 
malization. 

The graphs report the log-likelihood of each cluster on 
the y axis. The initial configuration corresponds to N 
points aligned on the x axis (zero log-likelihood). Each 
merge operation is represented graphically by a link be- 
tween the merging clusters and the new cluster. Hence as 
the log-likelihood decreases structures above the x axis 
start to form. Red links are merging steps which increase 
the log-likelihood. Blue links corresponds to situation 
where the log-likelihood of the union of the clusters is 
larger than that of each part but it is smaller than their 
sum (hence the total log-likelihood decreases). Hence sta- 
tistically relevant clusters appear as the large red struc- 
tures in the plot. 
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